Design of a lean interface for Sanskrit corpus annotation

نویسندگان

Gérard Huet

Pawan Goyal

چکیده

We describe an innovative computer interface designed for assisting annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpus. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting on the sandhi rules used, and aligning on the input sentence. We show that this representation allows an exponential saving, both in space and time. This interface has been implemented, and has been applied to the annotation of the Sanskrit Library corpus. 1 Generalities on Sanskrit linguistics Sanskrit is the primary culture-vehicle language of India. It has had a continuous production of literature in all fields of human endeavour over the course of four millennia, giving rise to an immense corpus which is to this date only partially digitalized. It benefits from a very sophisticated linguistic tradition stemming from the fairly complete grammar composed by Pān. ini by the fourth century B.C.E. During the last 15 years, a significant effort at developing Sanskrit Computational linguistics has been endeavoured, and considerable progress has been achieved at providing computer assistance at Sanskrit corpus processing (Scharf and Hyman, 2009; Huet et al., 2009; Kulkarni and Huet, 2009; Jha, 2010; Kulkarni et al., 2010; Kumar et al., 2010; Kulkarni and Shukl, 2009; Goyal et al., 2009; Hellwig, 2009; Goyal et al., 2012). Nevertheless, there does not exist at this date a complete analyser for Classical Sanskrit texts able to compute reliably morphological taggings in a completely automatic way. The main difficulty concerns segmentation, since Sanskrit is represented in writing by continuous phonetic enunciation, which demands complex processing for its analysis in separate word forms. Although complete algorithms for this segmentation preprocessing have been proposed (Huet, 2005), human assistance is still needed to focus on the intended solution within all possible analyses. We propose in this paper a new human-machine interface to help a professional annotator to decide quickly between all possible segmentations in order to select a unique morphological analysis among the many possible ones. Indeed, there exist thousands of such segmentations for simple sentences, and literally billions for complex ones. Once a sufficient amount of tagged corpus is available using such semi-automated annotation tools, it is hoped that it will be possible to use it for training a fully automated parser using statistical methods. 2 Segmentation analysis We are going to formalize the segmentation problem at various levels of abstraction. Firstly, we assume that Sanskrit text is represented as a list of phonemes. Sanskrit may be written in all Indian scripts, most usually in the Devanāgarı̄ script used by languages of North India such as Hindi, but such syllabic representation is awkward for morpho-phonetics computations, which operate at the phoneme level. It is thus preferable to translate the input into a list of phonemes, such translation being one-one. We assume the standard set of 50 phonemes, already known from the time of Pān. ini. Such low-level representation issues are discussed at length in (Scharf and Hyman, 2009;

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotating Sanskrit Corpus: Adapting IL-POSTS

In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b) , developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the require...

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-of-speech tagging with a Hidden Markov model. Parameters for these processes are estimated from a manually annotated corpus of currently about 1.500.000 words. The article sketches the tagging process, reports the results of tagging a few short passages of Sans...

متن کامل

An Effort to Develop a Tagged Lexical Resource for Sanskrit

In this paper we present our efforts the first time of its kind in the history of Sanskrit to design and develop a structured electronic lexical Resource by tagging a Traditional Sanskrit dictionary. We narrate how the whole unstructured raw text of Vaacaspatyam – an encyclopedic type of Sanskrit Dictionary has been tagged to form a user friendly e-lexicon with structured and segregated informa...

متن کامل

Design & Analysis of an Exhaustive Algorithm for Sandhi Processing In Sanskrit

––It is almost impossible to learn a new language without the study of it’s grammar .Automated language processing is in real centrally focused to drive to enable facilitated referencing of increasingly available Sanskrit E-texts. For learning Sanskrit language , the study of it’s grammar plays a very important role .Proposed research paper presents a fresh and new approach to processing Sandhi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Design of a lean interface for Sanskrit corpus annotation

نویسندگان

چکیده

منابع مشابه

Annotating Sanskrit Corpus: Adapting IL-POSTS

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

An Effort to Develop a Tagged Lexical Resource for Sanskrit

Design & Analysis of an Exhaustive Algorithm for Sandhi Processing In Sanskrit

عنوان ژورنال:

اشتراک گذاری